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Abstract 

This article reports the use of the BioC standard format in our sentence simplification sys- 
tem, iSimp, and demonstrates its general utility. iSimp is designed to simplify complex sen- 
tences commonly found in the biomedical text, and has been shown to improve existing 
text mining applications that rely on the analysis of sentence structures. By adopting the 
BioC format, we aim to make iSimp readily interoperable with other applications in the bio- 
medical domain. To examine the utility of iSimp in BioC, we implemented a rule-based rela- 
tion extraction system that uses iSimp as a preprocessing module and BioC for data ex- 
change. Evaluation on the training corpus of BioNLP-ST 2011 GENIA Event Extraction (GE) 
task showed that iSimp sentence simplification improved the recall by 3.2% without reduc- 
ing precision. The iSimp simplification-annotated corpora, both our previously used corpus 
and the GE corpus in the current study, have been converted into the BioC format and made 
publicly available at the project's Web site: http://research.bioinformatics.udel.edu/isimp/. 
Database URL:http://research. bioinformatics. udel.edu/isimp/ 



Introduction 

With the accelerating growth of biomedical publications, 
biologists have difficulty in keeping up with the new find- 
ings reported in the papers. Natural language processing 
(NLP) techniques have thus been developed to process the 
biomedical texts. However, the syntactic complexity of the 
language poses a challenge in designing and applying NLP 



systems. One solution is to simplify sentences before 
applying NLP techniques, thus concealing the syntactic 
complexity from further NLP steps. For this purpose, we 
have previously developed iSimp (1), a sentence simplifica- 
tion system. 

iSimp can be used as a preprocessing module to provide 
simplified text to enhance the performance of NLP systems 
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and text mining (TM) applications. To integrate iSimp into 
wide-ranging applications, we need to design customized 
adapters for data exchange. Recently, the BioC format has 
emerged as a community standard for the exchange of text 
documents and annotations (2). Based on an XML format, 
BioC is simple, yet robust, and very suited for iSimp's need. 

We participated in the BioCreative IV Track 1 (BioC: 
Interoperability) and adopted the BioC format in iSimp. In 
this article, we report how BioC is used with iSimp, and 
how iSimp can be integrated with various applications. 
Overall, this work makes three main contributions. 

The first contribution is the development of a BioC tag 
set for annotating simplification constructs. The tag set can 
be used in conjunction with any sentence simpUfication 
system to exchange data with other NLP systems. The 
standard tag set also serves the purpose of comparing the 
results among different simplification systems. 

The second contribution is a mechanism of using the 
BioC framework. The proposed mechanism denotes sim- 
plified sentences in a corpus file, along with the annotation 
of simplification constructs in the original sentence. It 
allows simplified sentences to be included in the BioC an- 
notation file so that they can be processed in place of (or in 
addition to) the original text. Moreover, the annotated 
phrases within simplified sentences can be mapped back to 
the original text. This mechanism is important for two rea- 
sons. First, it ensures that the output is presented aligned 
with the original text. Second, it allows the benchmarking 
of the NLP procedure, where the outputs must be aligned 
with the gold standard annotation in the original corpus. 

The third contribution of this work is the construction 
of an iSimp corpus presented in the BioC format. The cor- 
pus, consisting of 130 Medline abstracts annotated with 
six types of simplification constructs, can be used for the 
evaluation of the simplifier. In addition to this corpus, we 
also transform the GENIA Event Extraction (GE) corpora 
of the BioNLP-ST 2011 to BioC format. The GE corpora 
were used to evaluate the impact of iSimp in relation ex- 
traction (RE) tasks. All these corpora have been made pub- 
licly available for evaluating and comparing various 
simplification systems. 

To show the wide applicability and good performance of 
iSimp, we examined its impact on the RE task. We de- 
veloped a basic rule-based RE system to recognize the BioC 
format, presented how iSimp could enhance its performance 
and showed that iSimp was seamlessly added to the RE sys- 
tem with little effort required for the system integration. 

Background 

This section introduces the concepts and related work for 
sentence simplification and the BioC framework. 



Sentence simplification 

Sentence simplification is a technique to detect various 
types of clauses and constructs contributing to the com- 
plexity of sentences, and to produce two or more simple 
sentences while maintaining both coherence and the com- 
municated message. By reducing the complexity, sentence 
simplification can ease the development of NLP/TM tools, 
as well as other tools, such as machine translation tools. 
To illustrate the usefulness of sentence simpUfication, con- 
sider the following complex sentence from the biomedical 
literature: 

El. A third genetic linkage to disease is 
alpha- synuclein, a protein that is 
heavily phosphorylated in Lewy bodies 
and Lewy neuritis, the pathological 
hallmarks of PD. (PMID-22342821) 

In this example, we can see coordination (e.g. 'Lewy 
bodies and Lewy neuritis'), relative clause (e.g. 'that is heav- 
ily phosphorylated in...' referring to 'a protein') and appos- 
ition (e.g. 'a protein that is...' referring to 'alpha-synuclein' 
and 'the pathological hall marks of PD' referring to 'Lewy 
bodies and Lewy neuritis'). These are major syntactic con- 
structs that contribute to the complexity of sentences. After 
identifying these constructs, the complex sentence can be 
broken into multiple simple sentences. Here we show only 
two examples, which require combining the coordination 
with the relative clause and one of the appositions: 

E2 . Alpha-synuclein is heavily phosphory- 
lated in Lewy bodies . 

E3 . Alpha-synuclein is heavily phosphory- 
lated in Lewy neuritis . 

Pattern-based or machine-learning approaches will un- 
doubtedly find E2 and E3 much easier to process for the 
extraction of information/features than the original 
sentence. 

The automatic simplification of sentences was first intro- 
duced by (3) to improve the performance of systems that 
rely on natural language input. It has been subsequently 
used in the biomedical domain. As a preprocessing module, 
syntactic simplification can be used to prune irrelevant con- 
structs from shallow parsing results (4), parse trees (5), de- 
pendency graphs (6), or to produce different versions of the 
original sentence by splitting it in multiple sentences or by 
combining constituents (7-9). Also, some constructs, like 
coordination and apposition, can be assembled before pat- 
tern matching and unfolded after extraction (4, 7). 

BioC 

BioC is a framework that aims to provide an easy and 
powerful way of integrating TM tools (2). It uses an XML 
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format, which enables the sharing of documents and anno- 
tations (e.g. part-of-speech tags, named entities and entity 
relations). In the BioCreative IV workshop, many NLP 
tools incorporated the BioC format. They perform tasks 
such as abbreviations, semantic role labeling and gene nor- 
malization (10-13). The integration of iSimp with the 
BioC format is somewhat different from those cases be- 
cause iSimp produces new sentences besides the tagging of 
the original text. These new sentences may include words 
that were not in the original text. 

Materials and Methods 

In this section, we describe the methodology of iSimp and 
show how the BioC format is used to enhance the inter- 
operabiUty of iSimp. 

iSimp 

To make sentence simplification interoperable with other 
NLP/TM applications, we see a sentence simplifier as a 
module to be used at the beginning of NLP/TM pipelines. 
With this in mind, we designed and developed iSimp 
(http://research.bioinformatics.udel.edu/isimp). Figure 1 
shows how iSimp can be used as a module in an NLP/TM 
pipeline. It is worth pointing out that iSimp is designed to 
act like an optional plug-in. This means other applications 
are not expected to make changes to use iSimp. Instead, we 
should be able to plug iSimp in/out as needed, where the 
application can access original sentences, simplified sen- 
tences or both. 

Currently, iSimp can detect six types of simplification con- 
structs: coordination, relative clause, apposition, introduc- 
tory phrase, subordinate clause and parenthetical element. 
For a more detailed description of sentence simplification, as 
well as its challenges (e.g. attachment ambiguities, boundary 
detection and nested constructs), we refer the reader to (1). 

In comparison with the works using parse trees or de- 
pendency graphs, iSimp uses shallow parsing and recursive 
transition networks to detect all forms of simplifications. 
Figure 2 shows the workflow of the system. iSimp first 
tokenizes the text, and then it splits each sentence into a se- 
quence of nonoverlapping chunks. The detection of various 
simplification constructs is based on the chunks, and from 
these, iSimp generates simplified sentences. Three types of 



Original text 
in BioC 



— *- iSimp 



Simplified text 
in BioC 



chunks were investigated here: noun phrases, verb groups 
and prepositional phrases. 

iSimp scans the phrase sequence from left to right. 
Whenever a trigger word of a simplification construct is 
found (e.g. 'and' for coordination or 'which' for relative 
clause), we attempt to identify the simplification construct 
using transition networks. If a stop state of the network is 
found, then a simplification construct was detected suc- 
cessfully. We extended the network to address nested con- 
structs. For an in-depth description of this process, we 
refer the reader to (1). 

iSimp generates simplified sentences by combining vari- 
ous simplification constructs. To illustrate the problem, 
consider the following sentence: 

E4. Active Raf-2 [coordination phosphorylates 

and activates] MEKl, [relative clause which 
in turn [coordination phosphorylates and 
activates] the MAP ]<:inases signal regu- 



lated J^inases , [coordinati 



on & appositive 



ERKl 



andERK2]]. (PMID- 8557975 ) 

iSimp is able to generate several simple sentences from 
(E4). Five of them are shown below: 

E5. (a) Active Raf-2 phosphorylates MEKl . 

(b) MEKl in turn phosphorylates ERKl . 

(c) MEKl in turn phosphorylates ERK2 . 

(d) The MAP ]<:inases signal regulated 

l<:inases is an ERKl . 

(e) The MAP ]<:inases signal regulated 

]<:inases is an ERK2 . 

Sometimes, iSimp will introduce new words in the sim- 
plified sentence to keep it grammatically correct. For ex- 
ample, in (E5d) and (E5e), we put 'is an' between the 
appositive clause and the singular noun phrase it refers to, 
to form a new sentence. Adding new words to the corpus is 
one of the factors that distinguish iSimp from other appli- 
cations that enhance BioC. 

iSimp in BioC format 

Because sentence simplification requires a unique schema 
to add new text in the corpus, we designed a BioC tag set 
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Figure 1. NLP pipeline with iSimp. 



Figure 2. The workflow of iSimp. 
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for annotating and sharing the simphfication results. 
Figures 3 and 4 show the key file used in iSimp to define 
the semantics associated with the data. 

We use the annotation element to mark up the simplifi- 
cation construct components, and we use the relation elem- 
ent to specify how these components are related. In the 
latter, we further specify the name of the simplification type 
(e.g. coordination, relative clause, etc.), as well as roles for 
each component in the relation using the node element (e.g. 
'conjunct' and 'conjunction' for the coordination, 'referred 
noun phrase' and 'appositive' for the apposition). For ex- 
ample. Figure 5 shows the coordination 'phosphorylates 
and activates' in BioC format. This coordination contains 
two conjuncts ('phosphorylates' and 'activates') and one 
conjunction ('and'). Some attributes, like the location elem- 
ents, are not shown in this figure for lack of space. 

As mentioned before, iSimp generates new simplified 
sentences. This poses an additional challenge to the inte- 
gration of the BioC format, as such cases were not directly 
addressed in the original design of BioC (2). Hence, we de- 
signed and proposed a new way of using BioC framework. 
Figure 6 shows an example of simplified sentences in the 



BioC format (left), as well as the corresponding text file 
(right) with locations highlighted. As mentioned before, we 
include both original and simplified sentences in the BioC 
file. The offsets of the original sentences are the same as in 
the original text. However, the offsets of the simplified sen- 
tences start with the offset of the next character after the 
last character in the original document (offset of docu- 
ment + length of document). This new collection could 
then be treated as the input collection for the next step in 
the NLP pipeline. 

To link text in simplified sentences to that in the ori- 
ginal sentence, we introduce the 'equ' (equivalence) rela- 
tion. Figure 7 shows an example of an equivalence 
relation, in which we link 'phosphorylates' back to the ori- 
ginal sentence. This way phrases in the simplified sentences 
can be mapped back to the corresponding phrases in the 
original sentence. Equivalence relations can be used to en- 
sure that downstream appUcations recognize the dupU- 
cated nature of such 'equivalent' phrases and do not report 
the same information multiple times in the end. 
Implementation of this mechanism was feasible owing to 
the extensibility of the BioC format. 



This key file defines the simplification constructs in the BioC XML file. 

collection: This collection is an abstract from PubMed article, 
source: PubMed 

date: yyyymmdd. Date this example was created, 
document: this collection contains one document, 
id: PubMed Identifier (PMID) 

passage: the second sentence from the abstract 
infon type: abstract 

offset: abstract arbitrarily starts at 0. 

sentence: one sentence of the passage as determined by the opennlp 
sentence splitter, 
offset: a document offset to where the sentence begins in the 

passage. The sum of the passage offset and the local offset 
within the passage, 
annotation: 

infon type: simplification construct, 
location: location of the annotated text, 
text: the annotated text 
relation: there are 3 types of simplification constructions: coordination, 
relative clause, and apposition. Each described 
separately below 
coordination : 

infon type: coordination 

node: conjunct (there should be 2 or more conjuncts) and conjunction 
relative clause: 

infon type: relative clause 

node: referred noun phrase and relative clause 
apposition: 

infon type: apposition 

node: referred noun phrase and appositive 
parenthesis : 

infon type: parenthesis 

node: referred noun phrase and parenthesized elements 
Figure 3. The key file used in iSimp to define the simplification constructs associated with the data. 
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This key file defines the simplified sentences in the BioC XML file. 

collection: This collection is an abstract from PubMed article (PMID 8557975) . 
source: PubMed 

date: yyyymmdd. Date this example was created, 
document: this collection contains one document, 
id: PubMed Identifier (PMID) 

passage: the second sentence from the abstract 
infon type: abstract 

offset: abstract arbitrarily starts at 0. 

sentence: the first sentence is the original sentence. The following 
sentences are simplified sentences, 
infon type: original sentence or simplified sentence 
offset: the original sentence have the same offsets. The simplified 

sentences' offsets start with passage . of f set + passage . length . 
text: the original UTF-8 Unicode text as it appears in the original 

document . 
annotation: tokens in the sentence 
infon type: token 

location: location of the annotated text, 
text: the annotated text 
relation: map tokens in the simplified sentences to tokens in the 
original sentence, 
infon type: equ 

node role: original. Token in the original sentence 

node role: simplified, token in the simplified sentences. There might be 
several "node simplified", if one token in the original 
sentence appears several times in the simplified sentences. 

Figure 4. The key file used in iSimp to define the simplified sentences associated with the data. 



<sentence> 



:<text>Active Raf-1 phosphorylates and activates the 

:<annotatron ici=^'tO^> 



mitogen-activated ...</text>: 




<infon key="type">simplif ication construct</inf on>; 
< t ext >phosphorylates< / 1 ext > 
:</annotation> 
:<annotation id="tl"> 

<infon key="type">simplif ication construct</inf on>5 
<text>and</text> 
:</annotation> 
:<annotation id="t2"> 

<infon key="type">simplif ication construct</inf on> J 
< t ex t >Mti^vates^< / 1 ex t > 
:</annotation> 
:<relation id="rO"> 

< infon key="simp">coordination</inf on> 
<node refid="tO" role="^conjunct " /> 
<node refid="tl" role="coniunction" /> 
<node refid="t2" role="coniunct " /> 

A ^ii^.^?.'^.. ...........^ 

</ sentence> 



Figure 5. An example of sentence simplification annotation in BioC format. The coordination contains two conjuncts ('phosphorylates', 'activates') 
and one conjunction ('and'). Some attributes, like the location elements, are not shown for the sake of space. 



Online iSinnp with BioC 

For various NLP/TM applications to use sentence simplifi- 
cation, we have made iSimp available online. It adopts the 
BioC format and supports two interfaces. 



Users can submit a document in the standard BioC 
format, which is described in Figure 5 of (2). The format 
requires a document to be specified as a sequence of sen- 
tences where the offsets are specified with respect to the 
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<passage> 

<inf on key="type">abstract</inf on> 
<of f set>0</of f set> 

y»imi ■ ■ ■ 

:<sentence> 

: <infon key="type">original sentence</infon> 
I <of f set>70^</of f set> 
: < t e X t > Act^ve^ Ra f :^^^ho i vat e s the 

: mi±ogexi^ kinase /extrace^^^ 

: sij[M 1 -^egjiljL^ < / 1 e X t > 
:</sentence> 



:<sentence> 

■ <infon key="type">simplif ied senLence</inf on> 

• <of fset>325</of fset> 

• <text>ActjLve_Raf3l_gh(^^ 





:<sentence> 

■ <infon key="type">simplif ied sentence</inf on> 
I <of f set>390</of f set> 

I < t e X t >>ffiiyL^ t u rn^l^^ / 1 e x t > 





Abstract 

TCR cii^iagetiieiil sliiiuilatcs llie adivalion of tlic paileiii kinase Raf-I. 
Active Kaf-1 plI(>splulryllltt'^ and activati's llie niito^cii-aclivated 
prottin AIM'I kinitsc/i vlriuilUilar vi;^iial-ni;ula(v(l kinasi kinase 
l^(^lKKljJ^llicll in tu^ phuspliurvlates and activates the iNlA^ 

Raf-1 activity promotes IL-2 production in activated T lymphocytes. 
Therefore, we sought to determine whether MEKl and ERK actixaties 
also stimulate IL-2 gene transcription. Expression of constitutivcly 
active Raf-1 or MEKl in Jurkal T cells enlianced tlie stimulation of 
IL-2 promoter-driven transcription stimulated by a calcium iooophorc 
and PMA, and together with a calcium ionophore the expttssioii of 
each protein was sufticient to stimulate NF-AT activity. Expression of 
MEKl -interfering mutants inhibited the stimulation of IL-2 
promoter-driven transcription and blcKked tlie ability of constilutively 
active Ras and Raf-i to costimulatc NF-AT activity with a calcium 
ionophore. F.xpression of the MAP kinase-specitic phosphatase, 
MKP-I. which blocks FJ^K activation, inhibited IL-2 promoter and 
NF-AT-drivcn transcription stimulated by a calcium ionophore and 
PMA, and in addition, MKP-1 neutralised tlie transcriptional 
enhancement caused by active Raf-1 and MEKl expression. Wc 
conclude that tite MAP kinase signal transduction pathway consisting 
of Raf-1. MEKl . and ERKl and ERK2 functions in the stimulation 
IL-2 gene transcription in activated T lymphocytes. 

Acth'c Raf- 1 phosphorylalw \IEKI^ 
Active Raf- 1 activates MEK I . 
MF^KI In tuni phosp^lio^ryla^K^RKI. 
MF.Kl in turn pi>osphorylales ERK2. 
MEKl in lurn activates ERKl. 
MEKl in turn activates ERK2. 



</passage> 



Figure 6. An example of simplified sentences in BioC format (left) and the corresponding text file (right) with locations highlighted. 



s^a^e> .......I 

entence> 

<infon key="type">original sentence</inf on> 

<text>Active Raf-1 phosphorylates and activates 
mitogen-activaLed protein (MAP) kinase / 
extracellular signal-regulated kinase 
kinase 1 (MEKl), which ...</text> 

<annotation id="t2"> 

< t e X t >johoaghorxlatos < / 1 e X t > 

</annotation> 

sentence> 



the : 



■ Vs ent'en ce :>■ ' 

■ <infon key="type">simplif ied sentence</inf on> 
5 <text>Active Raf-1 phosphorylates MEKl.</text> 
; <annotation id="t39^"> 

■ <text>phosphorylates</text> 
\ </annotation> 

/-?.^9f-?Jl?-^ - 

• 



phosphorylates --> 
■<relation id="r2"> 

■ <infon key="type">equ</inf on> 

■ <node refid=".t2^" role="original"/> 
• <node refid="t39" role="simplif ied" /> 

l.'iU^X^Uiii^. 

</passage> 




Abstract 

TCR engagement stimulates the activation of the protein kinase Raf-1. 
Active Raf- 1 p^ios|)hor>iates^and activates the mitogen-activated 
f^otcin <..MAP) kinase/cxiracdfcilar signal-regulated kinase kinase 1 
(MEKl ). which in turn phosphd^ latcs and activates the MAP 
kinases/extracellular signal regullted kinases. ERKl and F.RK2. Raf-I 
activity promotes IL-2 production\n activated T lymphocytes. 
Therefore, we sought to determiney hether MF.Kl and ERK activities 
also stimulate lL-2 gene transcription. E.ipression of constitutivcly 
active Raf-1 or MEKl in Jurkat T l4i1s enlianced the stinmlation of 
IL-2 promoter-driven transcription stimulated by a calcium ionophore 
and PMA, and together with a calciiiii ionopliorc tlx* expression of 
each protein was sufficient to stimulate NF-AT activity. Expression of 
MEK I -interfering mutants inhibited Ihe stimulation of IL-2 
promoter-driven transcription and blicked tlie ability of coustitutively 
active Ras and Raf- 1 to costimulatc KF-AT acti\'ity with a calcium 
ionophore. Expression of the MAP Kinase- specific phosphatase, 
MKP- 1 . which blocks ERK activation, inhibited IL-2 promoter and 
NF-AT-driven transcription stimulated by a calcium ionophore and 
PMA. and in addition. MKP-I neutfalized the transcriptional 
enhancement caused by active Ray I and MEKl expression. We 
conclude that the MAP kinase siafial transduction pathway consisting 
of Ral-I. MEKl. and ERKl and/l-RK2 functions in Uic stimulation 
IL-2 gene transcription in activ/tcd T lymphocytes. 

Active Raf-1 phosghor^lates^MEK 1 . 
Active Raf-1 aclivates MEKl. 
MEKl in turn phosphorylates ERKl. 
MF.Kl in turn phosphorylates ERK2. 
MEKl in turn activates ERKl. 
MEKl in turn activates ERKl 



Figure 7. An example showing 'equ' (equivalence) relations in iSimp-generated BioC file. 



whole document. Given the input file, iSimp will output 
the list of sentences marked with simpUfication construc- 
tions. Moreover, iSimp will append the simpUfied sen- 
tences to the marked input sentences, and provide this 
output as a zip file for download. For displaying different 
sentence simplification aspects, we have also developed a 
web interface where users can provide sentences in plain 



text and iSimp will output the sentences marked with sim- 
plification constructions directly in the browser. 

To support interoperable machine-to-machine inter- 
action with other applications, iSimp can be accessed by 
enclosing the BioC file in the POST requests. The iSimp 
Web server will accept and process one sentence per re- 
quest and send back simplification constructs and 
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simplified sentences in an all-in-one BioC file. This will 
guarantee the response time and avoid loading overly large 
BioC files. To submit sentences in one BioC file (BioC sen- 
tence Document type definition (DTD)), users can use the 
follov^ing format: http://research.bioinformatics.udel.edu/ 
isimp/biocsentence?biocfile=BioCFileContent. 

Results and discussion 

Evaluation on RE system 

To examine the usefulness of iSimp, we considered a very 
simple rule-based RE system. The first relation we focused 
on was the phosphorylation relation between the trigger 
and the theme (substrate) as defined in the GE corpora. We 
used straightforward rules, as shown below, where X is a 
noun phrase in which the protein or protein product ap- 
pears as a headword: 

1. phosphorylate (or, phosphorylates, phosphorylated, 
phosphorylating) X 

2. phosphorylation of X 

3. X phosphorylation 

4. [noun phrase phosphorylated X] 

These rules are able to match straightforward mentions 
of phosphorylation in text. However, they will fail to find 
mentions of phosphorylation in complex sentences, like 
the one shown in (E4). However, the first rule can apply to 
the simplified (E5a)-(E5c) and extract <phosphorylates, 
MEK1>, <phosphorylates, ERK1> and <phosphorylates, 
ERK2>. As long as the rules for extraction are precise, the 
simplification step will help improve the recall of the sys- 
tem, without hurting the precision. 

We evaluated iSimp in terms of the impact it had on the 
performance of the RE system. Thus, we compared the re- 
sults obtained by the RE system when using versus not 
using iSimp. The BioC XML format and schema described 
in the previous section were used to transfer the original 
data to iSimp and the RE system as well as to transfer the 
enhanced data from iSimp to the RE system. Besides add- 
ing and removing iSimp from the pipeline, no additional 
changes were made to the steps involved in the pipeline. 
This not only shows the interoperabiUty of iSimp, but also 
proves that our proposed mechanism of using the BioC 
framework works as expected. 

We tested this basic RE system on the BioNLP-ST 2011 
GE task training corpus (14). Precision/Recall/F-value with- 
out simplification were 97.32/78.38/86.83 versus 97.42/ 
81.62/88.82 with simplification. These results show that 
with the help of iSimp, the recall gap of 21.62 was reduced 
by 15% to 18.38, without introducing precision errors. In 
our previous and ongoing work (15), we have observed 
similar improvements in recall for various RE tasks. 



In this exercise, we did not include agents because the 
GE corpora did not consider agents. But, because the 
above rules are most likely to be affected by noun phrase 
coordination, we believe simplification will benefit the 
agent extraction as well. 

This exercise also illustrates the ability of sentence sim- 
plification to keep rules simple and yet achieve good 
results. Because patterns for simplification and RE are or- 
thogonal, we do not need to multiply rules to consider all 
their combinations. An alternative way, as shown in the 
above example, is to treat sentence simplification as an in- 
dependent task, and not for a particular RE. This way, we 
can focus on simple rules only. Sentence simplification is 
then appUed to increase the recall of the original system. 

Simplification-annotated corpora in the BioC 
format 

We provide a corpus marked with simpUfication con- 
structs, using the BioC format (http://research.bioinfor 
matics.udel.edu/isimp/corpus.html). This corpus can be 
used by others to evaluate the performance of iSimp or 
other sentence simplifiers. The corpus consists of 130 
Medline abstracts mentioning proteins and genes, with a 
total of 1199 sentences. The corpus contains three BioC 
files: (i) MedUne abstracts of raw text, (ii) sentences that 
are split using the OpenNLP sentence detector and (iii) an- 
notations of simplification constructs at the sentence level. 
Key files are also provided with additional information 
that describes the meaning of tags used in the BioC files 
and the annotation schema. The corpus uses the same 
DTD provided by BioC for validation. 

Additionally, we have converted the BioNLP-ST 2011 
GE corpus to the BioC format for our evaluation purposes, 
and this corpus can also be downloaded from the link 
given above. 

Conversion script 

We provide a script to convert the BioNLP-ST corpus to 
the BioC format (https://bitbucket.org/udbiotmgroup/bion- 
lp2bioc). The original text files (.txt) are spUt based on 
'newUne', and the various parts are stored into passage 
elements. Entities (in files. 1) and event triggers (in files.2) 
are stored into appropriate passages based on their pos- 
itions in the text files. Target annotations (in files.2), 
including events, relations, event modifications and 
equivalences, are recorded at the document level. If the an- 
notation is marked by more than one continuous span of 
characters, the script creates several location elements. 
This also shows the generalizability of the BioC format, 
which allows multi-segmented annotations. 
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Conclusion 

In this study, we enhanced our sentence simpUfier system, 
iSimp, to fully adopt the BioC format. We defined a unique 
BioC tag set for annotating simplification results and pro- 
posed a schema, which allows simplified sentences to be 
included in the BioC annotation file and be treated as part 
of the original collection. The proposed schema is different 
than the standard schema in that it can include words that 
are not part of the original text. 

To illustrate the usefulness of iSimp with BioC, we 
examined its impact on a basic RE system. Evaluation on 
the BioNLP-ST 2011 GE task training corpus showed that, 
with sentence simplification provided by iSimp, the recall 
increased by 3.2%, which corresponds to a 15% reduction 
in recall error, without introducing precision errors. 
These corpora converted into the BioC format were made 
publically available together with the conversion script. 
Additionally, corpora we had previously developed for 
evaluating simpUfication performance of iSimp were made 
available in the BioC format, which may be used as public 
benchmarking corpora. 

The corpora and the online demo of iSimp, using the 
BioC format, are available at http://research.bioinfor- 
matics.udel.edu/isimp/. 
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